Lichess.org Dataset

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

df = pd.read_csv("chess.csv")
df.loc[df["rated"] == True, ["rated"]] = "Rated"
df.loc[df["rated"] == False, ["rated"]] = "Unrated"
df
Out[1]:
id rated created_at last_move_at turns victory_status winner increment_code white_id white_rating black_id black_rating moves opening_eco opening_name opening_ply
0 TZJHLljE Unrated 1.504210e+12 1.504210e+12 13 outoftime white 15+2 bourgris 1500 a-00 1191 d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5... D10 Slav Defense: Exchange Variation 5
1 l1NXvwaE Rated 1.504130e+12 1.504130e+12 16 resign black 5+10 a-00 1322 skinnerua 1261 d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6... B00 Nimzowitsch Defense: Kennedy Variation 4
2 mIICvQHh Rated 1.504130e+12 1.504130e+12 61 mate white 5+10 ischia 1496 a-00 1500 e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc... C20 King's Pawn Game: Leonardis Variation 3
3 kWKvrqYL Rated 1.504110e+12 1.504110e+12 61 mate white 20+0 daniamurashov 1439 adivanov2009 1454 d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O... D02 Queen's Pawn Game: Zukertort Variation 3
4 9tXo1AUZ Rated 1.504030e+12 1.504030e+12 95 mate white 30+3 nik221107 1523 adivanov2009 1469 e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N... C41 Philidor Defense 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20053 EfqH7VVH Rated 1.499791e+12 1.499791e+12 24 resign white 10+10 belcolt 1691 jamboger 1220 d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5... A80 Dutch Defense 2
20054 WSJDhbPl Rated 1.499698e+12 1.499699e+12 82 mate black 10+0 jamboger 1233 farrukhasomiddinov 1196 d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd... A41 Queen's Pawn 2
20055 yrAas0Kj Rated 1.499698e+12 1.499698e+12 35 mate white 10+0 jamboger 1219 schaaksmurf3 1286 d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd... D00 Queen's Pawn Game: Mason Attack 3
20056 b0v4tRyF Rated 1.499696e+12 1.499697e+12 109 resign white 10+0 marcodisogno 1360 jamboger 1227 e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb... B07 Pirc Defense 4
20057 N8G2JHGG Rated 1.499643e+12 1.499644e+12 78 mate black 10+0 jamboger 1235 ffbob 1339 d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3... D00 Queen's Pawn Game: Mason Attack 3

20058 rows × 16 columns

Total amount of games played grouped by winner

The following graph shows a higher winrate for the white player, indicating a slight advantage. This is supported by the theory of "First-move advantage" in chess, which dictates how the two sides overall play throughout the game. It is more common, as a white player, to choose a more aggressive approach.

In [2]:
totalRated = df[df['rated'] == 'Rated'].shape[0]
ratedWinner = df[df['rated']=='Rated'].groupby('winner').size()/totalRated *100

totalUnrated = df[df['rated'] == 'Unrated'].shape[0]
unratedWinner = df[df['rated']=='Unrated'].groupby('winner').size()/totalUnrated *100
ratedWinner = ratedWinner.reset_index()
ratedWinner['rated'] = 'Rated'

unratedWinner = unratedWinner.reset_index()
unratedWinner['rated'] = 'Unrated'

winners = pd.concat([ratedWinner,unratedWinner])
winners = winners.rename(columns={0:'count'})
fig = px.histogram(winners, x="rated", y='count', color="winner", barmode="group", title="Total amount of games played grouped by winner"
                  , color_discrete_map={
                "white": "#f5cc5b", "draw": "orange", "black":"#260f05"})

labels={"winner":"Winner", "count":"Count"}
fig.layout["xaxis"]["title"] = "Winner"
fig.layout["yaxis"]["title"] = "Percentage"

fig.show()

Normally, the number of unrated games would be higher than the number of rated games, in any given game. Interestingly enough, Lichess users tend to gravitate towards rated games rather than towards unrated ones.

Total amount of games played grouped by endgame status

A high percentage of resignations indicates how the average player tends to give up as soon as the game state becomes more and more difficult.

In [3]:
rated = df.loc[df["rated"] == 'Rated']
unrated = df.loc[df["rated"] == 'Unrated']
totalRated = df[df['rated'] == 'Rated'].shape[0]
ratedWinner = df[df['rated']=='Rated'].groupby('victory_status').size()/totalRated *100

totalUnrated = df[df['rated'] == 'Unrated'].shape[0]
unratedWinner = df[df['rated']=='Unrated'].groupby('victory_status').size()/totalUnrated *100
ratedWinner = ratedWinner.reset_index()
ratedWinner['rated'] = 'Rated'


unratedWinner = unratedWinner.reset_index()
unratedWinner['rated'] = 'Unrated'


winners = pd.concat([ratedWinner,unratedWinner])
winners = winners.rename(columns={0:'count'})

fig = px.histogram(winners, x="rated", y='count', color="victory_status", barmode="group", title="Total amount of games played grouped by endgame status")

labels={"winner":"Victory Status", "count":"Count"}
fig.layout["xaxis"]["title"] = "Victory status"
fig.layout["yaxis"]["title"] = "Percentage"

fig.show()

Probability of an upset given elo difference brackets

ELO (Elo rating system) is a way of rating players within a game system. It takes into account each win and loss, plus the number of overall games played. An upset is defined as a game where a lower rated player manages to win over a higher rated player. By dividing data into brackets given the difference between the two players' ratings, we can see how the higher the difference is, the least likely it is for an upset to occur. Of course, given how the matchmaking algorithm works, there is a lower chance for players with a high elo difference to be matched together, Thus the amount of games at higher brackets are lower, resulting in less precise plots.

In [4]:
mask_white = (rated["winner"] == "white") & (rated["white_rating"] > rated["black_rating"])
mask_black = (rated["winner"] == "black") & (rated["white_rating"] < rated["black_rating"])
rated["upset"] = True
rated.loc[mask_white, ["upset"]] = False
rated.loc[mask_black, ["upset"]] = False

rated['elo_difference'] = abs(rated['white_rating'] - rated['black_rating'])
rated['elo_interval'] = pd.cut(rated['elo_difference'],[0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300], include_lowest=True )

newDf = rated[rated['upset'] == True].groupby("elo_interval").size().reset_index().rename(columns={0: "count"})
tmp = rated.groupby("elo_interval").size().reset_index().rename(columns={0: "total"}).set_index('elo_interval')
newDf = newDf.set_index('elo_interval')
newDf = pd.concat([newDf,tmp], axis=1)
newDf['percentage'] = newDf['count']/newDf['total'] *100
newDf.index = newDf.index.astype(str)

fig = px.line(newDf,y='percentage' ,title="Probability of an upset given elo difference brackets")
fig.layout["xaxis"]["title"] = "Elo Difference"
fig.layout["yaxis"]["title"] = "Upset Percentage"

fig.show()

Most used opening types

In the context of chess, an opening refers to a strategy, or a set of strategies, used in the initial stages of the game. There are multiple types of opening, but it can be useful to divide them in two categories, aggressive and defensive openings. An player using an aggressive opening attempts to establish the pace of the game, while trying to shut down the opponent's attempts at regaining control. A player using a defensive opening attempts to slow down the game and respond to their opponent's moves. As mentioned before, white usually tends to prefer aggressive openings, given how it always moves in advance.

The following graph reports the 10 most used openings in the context of rated games. The results are both expected and to a degree surprising. Considering the most used opening, the sicilian defense, is a defensive opening, there are a few things that come to mind. Of course, given the fact that the Sicilian Defense is a very popular opening, it isn't suprising we can find it in first place. That said, it is a defensive opening. Considering black prefers defensive openings, contrary to white, we can surmise that either black players use this opening very often, or white players also tend to prefer the Sicilian Defense, despite it being "against the norm".

In [5]:
rated['opening_type'] = rated['opening_name'].str.split(':').str.get(0)
total_games = rated.shape[0]
openings = rated.groupby('opening_type').size().reset_index(name='count').sort_values(by='count',ascending=False)

openings.loc[~openings['opening_type'].isin(openings.head(10)['opening_type']), 'opening_type'] = 'Others'

fig = px.pie(openings, values='count', names='opening_type', title='Most used opening types')
fig.show()

Most used opening types - games including at least one player in the top 50

We ran the same analysis over a smaller dataset, this time comprised of games with players in the top 50. This implies that atleast one of the two players involved in the game was in the top 50 of all players, in terms of ELO.

In [6]:
top_white = rated
top_black = rated

top_white.sort_values(['white_id','last_move_at'], ascending=False)
top_white = top_white.groupby("white_id").first()

top_black = top_black.sort_values(['black_id','last_move_at'], ascending=False)
top_black = top_black.groupby("black_id").first()

top = pd.DataFrame()

top['white_date'] = top_white['last_move_at']
top['black_date'] = top_black['last_move_at']

top['white_date'] = top['white_date'].fillna(0)
top['black_date'] = top['black_date'].fillna(0)

top["white_rating"] = top_white["white_rating"]
top["black_rating"] = top_black["black_rating"]

top['white_rating'] = top['white_rating'].fillna(0)
top['black_rating'] = top['black_rating'].fillna(0)

top['last_rating'] = top['black_rating']
mask = top['white_date'] > top['black_date']
top.loc[mask , ['last_rating']] = top['white_rating']

top = top.sort_values('last_rating', ascending=False)
In [7]:
top50 = rated[(rated['white_id'].isin(top.head(50).index)) | (rated['black_id'].isin(top.head(50).index))]
total_games = top50.shape[0]
top_openings = top50.groupby('opening_type').size().reset_index(name='count').sort_values(by='count',ascending=False)

top_openings.loc[~top_openings['opening_type'].isin(top_openings.head(10)['opening_type']), 'opening_type'] = 'Others'

px.pie(top_openings, values='count', names='opening_type', title='Most used opening types - games including at least one player in the top 50')

Winners grouped by the 10 most used openings

A few interesting points:

  • The English Opening has the highest winrate on white among all the most popular openings (~55%). This can be explained by the fact that it is a set of openings designed to deal with and counter the most common defenses employed by black.
  • The Sicilian Defense has the highest winrate on black. This further proves how black prefers defensive openings, and could indicate how the large amount of use the opening sees might primarily come from black.
  • The King's Pawn game ends up being the most balanced opening set in terms of winrate. This might be due to the fact that it comprises many different openings, with various efficacies, but it an interesting contrast with offline chess, where the opening has a higher winrate on white (~ 54%). It is also interesting how in offline chess, the King's Pawn game is considered the most popular opening move. This goes against the data gathered on the website.
In [8]:
opening_victory = rated[rated['opening_type'].isin(openings['opening_type'])].groupby(['opening_type', 'winner']).size().unstack()

opening_victory['total'] = opening_victory['black'] + opening_victory['white'] + opening_victory['draw']
opening_victory['white'] = (opening_victory['white'] / opening_victory['total'])*100
opening_victory['black'] = (opening_victory['black'] / opening_victory['total'])*100
opening_victory['draw'] = (opening_victory['draw'] / opening_victory['total'])*100

opening_victory = opening_victory.sort_values(by="white")
fig = px.bar(opening_victory, y=opening_victory.index, x=["white","draw", "black"], title="Winners grouped by the 10 most used openings", labels={
                "variable": "Winner"},
             color_discrete_map={
                "white": "#f5cc5b", "draw": "orange", "black":"#260f05"
            }
            )
fig.layout["xaxis"]["title"] = "Percentage of victories"
fig.layout["yaxis"]["title"] = "Opening Type"
fig.show()

Game moves number grouped by Elo Difference

In [9]:
rated['elo_interval'] = pd.cut(rated['elo_difference'],[0,200,400,600,800,1000,1300], include_lowest=True )
newDf = rated.sort_values(by='elo_interval')
newDf['elo_interval'] = newDf['elo_interval'].astype(str)
fig = px.box(newDf, x='elo_interval', y="turns", title="Game moves number grouped by Elo Difference")
fig.layout["xaxis"]["title"] = "Elo Difference"
fig.layout["yaxis"]["title"] = "Moves"

fig.show()

Most frequently used pieces

The following set of graphs shows an interesting relation. The rook is the least used piece within the opening stage, but skyrockets after. Given the frequency of 'castling' (a move with which the king and the rook swap places), we can assume that the rook is mostly kept as a way to protect the king, and only rarely used, also due to its position, in the opening stage.

Standard chess notation:

  • N: Knight
  • B: Bishop
  • R: Rook
  • Q: Queen
  • K: King
  • O-O: Castling
  • none : Pawn

Castling involves the king and a rook. It serves as a way to further protect the king while allowing the rook to take a more active role in the game. Castling can only be used if neither the rook and the king have moved.

In [10]:
all_moves = rated.moves.sum().split() #may take some time
all_moves_df = pd.DataFrame(all_moves)
all_moves_df = all_moves_df.rename(columns={0: "move"})
most_used_moves = pd.DataFrame(all_moves).groupby(0).size().reset_index(name='count').set_index(0).sort_values(by='count', ascending=False)
fig = px.bar(most_used_moves.head(10), x='count', title="Most frequently used moves")
fig.layout["xaxis"]["title"] = "Count"
fig.layout["yaxis"]["title"] = "Move"
fig.show()
In [11]:
def addPiece(df):
    df['piece'] = 'Pawn'
    df.loc[df.move.str.startswith('K'), ['piece']] = 'King'
    df.loc[df.move.str.startswith('Q'), ['piece']] = 'Queen'
    df.loc[df.move.str.startswith('B'), ['piece']] = 'Bishop'
    df.loc[df.move.str.startswith('N'), ['piece']] = 'Knight'
    df.loc[df.move.str.startswith('R'), ['piece']] = 'Rook'
    df.loc[df.move.str.startswith('O'), ['piece']] = 'Castling'

Interestingly enough, the order shown in the graph follows the natural order of the pieces' values (Queen 9, Rook 5, Bishop 3, Knight 3, Pawn 1), expect for the king and the rook, which are skewed due to castling.

In [12]:
addPiece(all_moves_df)
data = all_moves_df.groupby('piece').size().reset_index(name='count')
total_moves = len(all_moves)
data = data.set_index('piece')

castlings = data.loc['Castling']['count']
kings = data.loc['King']['count']
rooks = data.loc['Rook']['count']

data.loc[data.index == 'King',['count']] = kings + castlings
data.loc[data.index == 'Rook',['count']] = rooks + castlings
data = data.drop('Castling').sort_values(by='count')

fig = px.bar(data, y=(data['count']/total_moves) *100, title="Most frequently used pieces")
fig.layout["xaxis"]["title"] = "Piece"
fig.layout["yaxis"]["title"] = "Percentage"
fig.show()
In [13]:
split_moves = pd.DataFrame(rated.moves.str.split()) #may take some time
split_moves['turns'] = rated['turns']
split_moves['opening_ply'] = rated['opening_ply']

opening_moves = split_moves[['moves' ,'opening_ply']].apply(lambda x: x['moves'][:x['opening_ply']], axis=1)
opening_moves = pd.DataFrame(opening_moves)
opening_moves = opening_moves.rename(columns={0: "moves"})
opening_moves = opening_moves.moves.sum()

all_opening_moves = pd.DataFrame(opening_moves)
all_opening_moves = all_opening_moves.rename(columns={0: "move"})
addPiece(all_opening_moves)


data = all_opening_moves.groupby('piece').size().reset_index(name='count')

data = data.set_index('piece')

castlings = data.loc['Castling']['count']
rooks = data.loc['Rook']['count']

data.loc[data.index == 'Rook',['count']] = rooks + castlings
data = data.drop('Castling').sort_values(by='count')

total_moves = len(opening_moves)
fig = px.bar(data, y=(data['count']/total_moves) *100, title="Most frequently used pieces in openings")
fig.layout["xaxis"]["title"] = "Piece"
fig.layout["yaxis"]["title"] = "Percentage"
fig.show()